There is a Goodreads API and an R interface to accompany it, however we will be using the data set found on Kaggle here because it is proving far too difficult to gather and process data from the Goodreads API.
First we will import the data and take a look at its contents.
books <- read_csv("data/books.csv") %>%
rename(rating = average_rating,
total_ratings = ratings_count,
lang = language_code,
pages = num_pages) %>%
na.omit()
books
## # A tibble: 8,469 x 12
## bookID title authors rating isbn isbn13 lang pages total_ratings
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 "Har… J.K. R… 4.57 0439… 97804… eng 652 2095690
## 2 2 "Har… J.K. R… 4.49 0439… 97804… eng 870 2153167
## 3 4 "Har… J.K. R… 4.42 0439… 97804… eng 352 6333
## 4 5 "Har… J.K. R… 4.56 0439… 97804… eng 435 2339585
## 5 8 "Har… J.K. R… 4.78 0439… 97804… eng 2690 41428
## 6 9 "Una… W. Fre… 3.74 0976… 97809… en-US 152 19
## 7 10 "Har… J.K. R… 4.73 0439… 97804… eng 3342 28242
## 8 12 "The… Dougla… 4.38 0517… 97805… eng 815 3628
## 9 13 "The… Dougla… 4.38 0345… 97803… eng 815 249558
## 10 14 "The… Dougla… 4.22 1400… 97814… eng 215 4930
## # … with 8,459 more rows, and 3 more variables: text_reviews_count <dbl>,
## # publication_date <chr>, publisher <chr>
So it looks like we are working with 12 variables, and 8469 observations. Some of these variables are redundant and, for our purposes, can be ignored.
books$bookID <- NULL
books$isbn <- NULL
books$isbn13 <- NULL
Since we speak English we’d like to focus solely on English language books. But first let’s see how many of them we have
books %>%
subset(is.na(as.numeric(lang))) %>%
subset(!is.na(lang)) %>%
group_by(lang) %>%
tally() %>%
arrange(desc(n)) %>%
slice(1:10) %>%
ggplot(aes(x = factor(lang, levels = lang), y = n, fill = lang)) +
geom_bar(stat = "identity") +
labs(title = "Title number of books by language",
x = "language",
y = "number of books") +
theme(axis.text.x = element_text(angle=90, hjust=1))
So the overwhelming majority of the data is in eng so we don’t loose much by ignoring the other publications. However, it looks like US and Great Britain version of books fall under a different language tag (and if we search hard Canadian too). For our purposes "en-CA",“en-GB”, anden-US` are the same so let us clean up the data a little.
books$lang <- books %>%
pull(lang) %>%
str_replace_all("en*(g|-GB|-US|-CA)", "eng")
eng_books <- books %>% filter(lang == "eng" & rating >= 2.5)
Next let’s get an idea of the distribution of ratings for English language books.
eng_books %>%
group_by(rating) %>%
summarize(total_books = n()) %>%
ungroup() %>%
ggplot(aes(x = rating)) +
geom_histogram(aes(y = total_books/1.00001, fill = rating),
stat = "identity",
freq = FALSE) +
labs(title = "Total number of books per rating",
x = "average rating",
y = "number of books")
And what about publications per year?
eng_books <- eng_books %>%
mutate(publication_year = as.numeric(str_sub(publication_date, -4)))
eng_books %>% group_by(publication_year) %>%
summarize(total_books = n()) %>%
ungroup() %>%
ggplot(aes(x = publication_year, y = total_books)) +
geom_point() +
xlim(1900, 2020) +
labs(x = "publication year",
y = "books published")
books_per_year <- eng_books %>% group_by(publication_year) %>%
summarize(total_books = n()) %>%
arrange(desc(total_books)) %>%
ungroup()
model_publish_books <- books_per_year %>%
filter(publication_year <= 2006) %>%
lm(log(total_books) ~ publication_year, .)
filtered_model <- books_per_year %>%
filter(publication_year > 2006) %>%
lm(log(total_books) ~ publication_year, .)
model_publish_books %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -159. 9.26 -17.1 8.62e-26
## 2 publication_year 0.0818 0.00470 17.4 3.47e-26
filtered_model %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 690. 72.3 9.54 0.000000593
## 2 publication_year -0.341 0.0359 -9.51 0.000000615
estimate_2020 <-
predict(model_publish_books,
newdata = data.frame(publication_year = seq(2015,2020)))
Who are the top rated authors? Naively we can look at the average rating of each author as follows.
eng_books %>%
group_by(authors) %>%
summarize(average_rating = sum(rating)/n()) %>%
arrange(desc(average_rating)) %>%
ungroup() %>%
slice(1:10)
## # A tibble: 10 x 2
## authors average_rating
## <chr> <dbl>
## 1 Chris Green/Chris Wright/Paul Douglas Gardner 5
## 2 James E. Campbell 5
## 3 Julie Sylvester/David Sylvester 5
## 4 Keith Donohue 5
## 5 Laura Driscoll/Alisa Klayman-Grodsky/Eric Weiner 5
## 6 Middlesex Borough Heritage Committee 5
## 7 Ross Garnaut 5
## 8 Sheri Rose Shepherd 5
## 9 Todd Davis/Marc Frey 5
## 10 William C. Dowling 5
who are these people?
So highly rated authors are not necessarily notable ones, perhaps the sum total of the ratings is a better indicator of good author.
prolific_writer_names <- eng_books %>%
group_by(authors) %>%
summarize(book_count = n(),
sum_rating = sum(rating)) %>%
arrange(desc(sum_rating)) %>%
slice(1:10) %>%
pull(authors)
prolific_writers <- eng_books %>% filter(authors %in% prolific_writer_names)
prolific_writers %>%
(function(...) {
tmp <- tibble(...)
tmp$authors <- tmp$authors %>%
str_replace_all("Margaret Weis/Tracy Hickman",
"Weis/Hickman")
tmp}) %>%
group_by(authors) %>%
summarize(book_count = n(),
sum_rating = sum(rating)) %>%
arrange(desc(sum_rating)) %>%
slice(1:10) %>%
ggplot(aes(factor(authors, levels = authors), sum_rating, fill = authors)) +
geom_col() +
coord_flip() +
labs(title = "Author rankings",
y = "author",
x = "sum rating") +
scale_fill_brewer(palette="Spectral")
OK, this is much better and most of them are recognizable. Now let’s take a look at their average rating.
prolific_writers %>%
(function(...) {
tmp <- tibble(...)
tmp$authors <- tmp$authors %>%
str_replace_all("Margaret Weis/Tracy Hickman",
"Weis/Hickman")
tmp}) %>%
group_by(authors) %>%
summarize(avg_rat = sum(rating)/n()) %>%
ungroup() %>%
ggplot(aes(x = authors, y = avg_rat, fill = authors)) +
geom_bar(stat = "identity") +
labs(title = "Top 10 average author rating",
x = "auhor",
y = "average rating") +
theme(axis.text.x = element_text(angle = 90))
So Stephen King is the most prolific writer in this data set and also has fairly good ratings. In this section we will take a look at models for predicting the popularity of a hypothetical future publication of Stephen King.
stephen_king <- books %>% filter(authors == "Stephen King" &
title != "Blood and Smoke" &
title != "LT's Theory of Pets")
sk_model <- lm(rating ~ pages, stephen_king)
sk_model %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.62 0.111 32.7 1.03e-26
## 2 pages 0.000795 0.000217 3.67 8.61e- 4
predict(sk_model,
newdata = data.frame(pages = c(300)))
## 1
## 3.855764
stephen_king %>% summarize(mean = mean(pages))
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 478.
Next we will at the top 10 publishers by rating. We will first need to turn the publisher name into a factor to more easily work with it.
eng_books$publisher <- factor(eng_books$publisher)
Next let us see who the top 10 publishers are.
top_10_publishers <- eng_books %>%
group_by(publisher) %>%
summarize(total_books = n()) %>%
arrange(desc(total_books)) %>%
slice(1:10) %>%
pull(publisher)
eng_books_tp <- eng_books %>%
filter(publisher %in% top_10_publishers)
Next let us see how the popularity of their publications have changed since 2000.
eng_books_tp %>%
filter(publication_year >= 2000 &
publication_year <= 2007) %>%
ggplot(aes(x = fct_reorder(publisher, rating),
y = rating, color = publisher)) +
geom_boxplot() +
labs(title = "Publisher rating in {closest_state}",
x = "publisher") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
transition_states(publication_year,
transition_length = 2,
state_length = 4) +
enter_fade() +
exit_shrink() +
ease_aes('sine-in-out')
Two publishers stand out from among the rest, VIZ Media LLC and Harper Collins. VIZ media publishes graphic novels and comic books which have a higher average rating than books.